Statistical inference from a single sample of data
Statistical inference from multiple samples of data
Sampling distributions of sample estimates are (often) determined using the Central Limit Theorem
Standard error as an estimate for the standard deviation of a sampling distribution
Margin of error
Critical values corresponding to particular confidence levels
Confidence intervals provide an (interval) estimate for the effect of interest. Hence, confidence intervals, and not hypothesis tests, can inform us about the effect size. This makes it easier for us to compare the results of our statistical analysis to a practical understanding of what kind of “effect” would be important in a particular problem setting.
Practical significance is not determined by statistical significance! Statistical significance is not determined by practical significance!
Hypotheses are typically designed so that what we want to prove is expressed in the alternative. For all of the methods that we’ve covered thus far, the null hypothesis is always going to be of the form \[H_0: \text{<parameter> } = \text{ some number}\]
The only way to reduce both types of error is to collect more evidence or, in statistical terms, to collect more data.
\(\alpha = Pr(\text{Type I error})\): If \(H_0\) is true, this is the probability that we (incorrectly) reject it.
\(\beta = Pr(\text{Type II error})\): If \(H_0\) is false, this is the probability that we (incorrectly) fail to reject it.
\(1-\beta = Power\) If \(H_0\) is false, this is the probability that we (correctly) reject it.
The logic of hypothesis tests is similar to the logic behind inter-universe travel in the movie Everything Everywhere All at Once…
Two independent groups
Two paired groups of data
On average, how much more money do consumers spend at Target compared to Walmart?
Suppose researchers collected a systematic sample from \(85\) Walmart customers and \(80\) Target customers by asking them for their purchase amount as they left the stores. The data they collected is summarized in the table below. Suppose a computer already calculated the degrees of freedom to be \(162.75\).
| Walmart | Target | |
|---|---|---|
| \(\bar{x}\) | \(\$45\) | \(\$53\) |
| s | \(\$21\) | \(\$19\) |
Step 1) Identify and define the population parameter and choose your confidence level.
Step 2) Calculate the sample estimate for the population parameter.
Step 3) Assess the required assumptions and conditions.
Step 4) Find the critical value corresponding to your confidence level.
Step 5) Calculate the standard error of your sample estimate.
Step 6) Calculate the lower and upper bounds of your confidence interval.
On average, how large is the difference in car insurance prices for customers of an online insurance company versus customers of a local insurance company?
Find a \(95\%\) confidence interval for the mean difference in insurance prices based on the data given below. The data below represents randomly selected insurance profiles (type of car, coverage, driving record, etc.) for 10 clients at a local provider and the corresponding quote from another online provider given their policy information.
mean(insurance_diff$PriceDiff)
## [1] 45.9
sd(insurance_diff$PriceDiff)
## [1] 175.6628
Week 14 - new statistical method (our last one for the semester)
Week 14 and 15 - begin discussions on ethical statistical practice
Week 15 - Friday in-class poster presentation
You and your group mates are welcome to attend anytime between 9:30am-10:30am or 11:00am-12:30pm.
Plan to spend at least 45 minutes in class and come early to hang up your poster in the room. (Prof Suzy will provide hanging supplies.)
Prof Suzy will take turns meeting with each group for 5-7 minutes where you will present your topic.
All participants will be asked to take some time and read the other posters. Each person will need to submit a 3-4 sentence summary ion another group’s project in order to get credit for attendance this day.
Step 1) \(\mu_1 - \mu_2 =\) mean amount spent at Target minus mean amount spent at Walmart. We’ll use a 95% confidence level.
Step 2) \(\bar{x}_1 - \bar{x}_2 = 8\)
Step 3) Assess the required assumptions and conditions - done in class.
Step 4) We need the critical \(t^*\) value corresponding to a 0.95 confidence level from a Student’s t distribution with \(162.75\) degrees of freedom. We can find this exactly using R and this value should be similar to the approximate critical value which you can read off the t-table.
qt(0.025, df = 162.75, lower.tail=TRUE)
## [1] -1.974647
Step 5) \(SE(\bar{x}_1 - \bar{x}_2) = \sqrt(\frac{19^2}{80} + \frac{21^2}{85}) = 3.115\)
Step 6) $ 8 (1.975 ) = [$1.848, $14.152]$ with interpretation given in class.
Step 1) Identify and define the population parameter and choose your confidence level.
\(\mu_{Diff} =\) the mean difference in insurance prices between online and local providers (local minus online)
Let’s use a 90% confidence level to mix things up.
Step 2) Calculate the sample estimate for the population parameter.
\(\bar{d} = \$45.9\)
Step 3) Assess the required assumptions and conditions.
Independence
10% condition
Randomization condition
Sample size (or nearly Normal) condition
The data is representative of the local insurance company because these
10 profiles were randomly selected. It’s not clear how large the local
insurance company is but it’s pretty likely that the company has more
than 100 customers. Therefore, there isn’t any strong indicator that the
difference data is not independent. (I.e. we can assume within sample
independence.) However, the sample size is rather small so in order to
use the CLT, we need to check a histogram of the difference data. This
histogram is symmetric and unimodal so it seems reasonable that the
larger population of all possible differences between process for
customers of this local company is approximately Normally distributed.
There aren’t any major red flags against any of the necessary
assumptions for this method.
Step 4) Find the critical value corresponding to your confidence level.
\(t^*_{0.90, dff=10-1}=1.833\) (note this is also the value you’d find using the t-table)
## [1] -1.833113
Step 5) Calculate the standard error of your sample estimate.
\(SE(\bar{d}) = \frac{175.66}{\sqrt{10}} = 55.549\)
Step 6) Calculate the lower and upper bounds of your confidence interval.
\(45.9 \pm \left(1.833 \times 55.549 \right) = [-55.928,147.728]\)
Thus, we are \(90\%\) confident that the true mean difference in insurance prices between this online and this local provider (local minus online) is between -$55.928 and $147.728. In other words, the local provider is anywhere from $55.98 cheaper to $147.73 more expensive than the online provider.
We can check our answer in R using the following code:
##
## One Sample t-test
##
## data: insurance_diff$PriceDiff
## t = 0.82629, df = 9, p-value = 0.43
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## -55.92845 147.72845
## sample estimates:
## mean of x
## 45.9